Design and Analysis of a Hardware CNN Accelerator
نویسندگان
چکیده
In recent years, Convolutional Neural Networks (CNNs) have revolutionized computer vision tasks. However, inference in current CNN designs is extremely computationally intensive. This has lead to an explosion of new accelerator architectures designed to reduce power consumption and latency [20]. In this paper, we design and implement a systolic array based architecture we call ConvAU to efficiently accelerate dense matrix multiplication operations in CNNs. We also train an 8-bit quantized version of Squeezenet[14] and evaluate our accelerator’s power consumption and throughput. Finally, we compare our results to the reported results for the K80 GPU and Google’s TPU. We find that ConvAU gives a 200x improvement in TOPs/W when compared to a NVIDIA K80 GPU and a 1.9x improvement when compared to the TPU.
منابع مشابه
Sparsity Analysis of Deep Learning Models and Corresponding Accelerator Design on FPGA
Machine learning has achieved great success in recent years, especially the deep learning algorithms based on Artificial Neural Network. However, high performance and large memories are needed for these models , which makes them not suitable for IoT device, as IoT devices have limited performance and should be low cost and less energy-consuming. Therefore, it is necessary to optimize the deep l...
متن کاملCompiling Deep Learning Models for Custom Hardware Accelerators
Convolutional neural networks (CNNs) are the core of most state-of-the-art deep learning algorithms specialized for object detection and classification. CNNs are both computationally complex and embarrassingly parallel. Two properties that leave room for potential software and hardware optimizations for embedded systems. Given a programmable hardware accelerator with a CNN oriented custom instr...
متن کاملOn-Chip CNN Accelerator for Image Super-Resolution
To implement convolutional neural networks (CNN) in hardware, the state-of-the-art CNN accelerators pipeline computation and data transfer stages using an off-chip memory and simultaneously execute them on the same timeline. However, since a large amount of feature maps generated during the operation should be transmitted to the off-chip memory, the pipeline stage length is determined by the of...
متن کاملComputation Error Analysis of Block Floating Point Arithmetic Oriented Convolution Neural Network Accelerator Design
The heavy burdens of computation and off-chip traffic impede deploying the large scale convolution neural network on embedded platforms. As CNN is attributed to the strong endurance to computation errors, employing block floating point (BFP) arithmetics in CNN accelerators could save the hardware cost and data traffics efficiently, while maintaining the classification accuracy. In this paper, w...
متن کاملPipeCNN: An OpenCL-Based FPGA Accelerator for Large-Scale Convolution Neuron Networks
Convolutional neural networks (CNNs) have been widely employed in many applications such as image classification, video analysis and speech recognition. Being computeintensive, CNN computations are mainly accelerated by GPUs with high power dissipations. Recently, studies were carried out exploiting FPGA as CNN accelerator because of its reconfigurability and energy efficiency advantage over GP...
متن کامل